Skip to content

fix(scraper): replace misleading 403 hint for AI Scraper Studio errors#5

Open
anil-bd wants to merge 1 commit into
brightdata:mainfrom
anil-bd:fix/stub-collector-hint
Open

fix(scraper): replace misleading 403 hint for AI Scraper Studio errors#5
anil-bd wants to merge 1 commit into
brightdata:mainfrom
anil-bd:fix/stub-collector-hint

Conversation

@anil-bd
Copy link
Copy Markdown

@anil-bd anil-bd commented May 18, 2026

When a bdata scraper create succeeds on the template POST but the subsequent AI-trigger POST 429s (e.g. because the user hit the AI Flow parallel-job cap), the half-built collector_id is still printed. If that id is then passed to bdata scraper run, the API returns 403 + body {"error":"Collector does not have a template"}.

Today the CLI maps any 403 to a fixed hint:

Hint: Access denied. Check your zone permissions in the control
      panel.

This sends the user 30+ minutes down a zone-permission rabbit hole that has nothing to do with the actual problem (the AI Flow never finished generating selectors for this collector). Observed multiple times during stress testing.

This change is structured so the AI Scraper Studio error vocabulary stays in the scraper command and does NOT leak into the shared HTTP client. scrape, search, discover, pipelines, and browser are unaffected.

Mechanism:

  • src/utils/client.ts gains a generic hints?: Body_hint[] field on Request_opts. The pure helper pick_hint(status, body, hints) consults the caller's list first and falls back to the existing ERROR_HINTS status-code map. The shared client ships ZERO command-specific patterns.

  • src/commands/scraper.ts defines SCRAPER_BODY_HINTS two patterns:

    • /collector does not have a template/i → AI generation didn't complete; re-run scraper create; web-UI URL for manual recovery.
    • /cannot run more than \d+ jobs in parallel/i → AI-Flow concurrent-job cap; serialise launches. Every post/get call in handle_create_scraper, handle_run_scraper, and run_batch passes hints: SCRAPER_BODY_HINTS so a 4xx from any of them is translated with the right vocabulary.
  • Real zone-permission 403s (any body that doesn't match the scraper patterns) still get the original "Access denied" hint — test 'does not consult ERROR_HINTS when an extra-hint pattern matches' locks this in.

Tests: 8 unit tests for client.pick_hint using mock generic patterns (covers mechanism + asserts the shared client carries no scraper vocabulary in ERROR_HINTS), plus 5 scraper command tests asserting the scraper patterns are well-formed and travel via hints to client.post on every AI-Flow call. Two existing tests relaxed from strict opts-object matches to objectContaining-style. 58 / 58 tests in the affected files pass. The 9 pre-existing failures in unrelated suites (daemon, add-mcp, browser, discover, scrape) on main are unchanged by this PR.

When a `bdata scraper create` succeeds on the template POST but the
subsequent AI-trigger POST 429s (e.g. because the user hit the AI Flow
parallel-job cap), the half-built `collector_id` is still printed.
If that id is then passed to `bdata scraper run`, the API returns
403 + body `{"error":"Collector does not have a template"}`.

Today the CLI maps any 403 to a fixed hint:

    Hint: Access denied. Check your zone permissions in the control
          panel.

This sends the user 30+ minutes down a zone-permission rabbit hole
that has nothing to do with the actual problem (the AI Flow never
finished generating selectors for this collector). Observed multiple
times during stress testing.

This change is structured so the AI Scraper Studio error vocabulary
stays in the scraper command and does NOT leak into the shared HTTP
client. `scrape`, `search`, `discover`, `pipelines`, and `browser`
are unaffected.

Mechanism:

* `src/utils/client.ts` gains a generic `hints?: Body_hint[]` field
  on `Request_opts`. The pure helper `pick_hint(status, body, hints)`
  consults the caller's list first and falls back to the existing
  `ERROR_HINTS` status-code map. The shared client ships ZERO
  command-specific patterns.

* `src/commands/scraper.ts` defines `SCRAPER_BODY_HINTS` — two
  patterns:
    - /collector does not have a template/i  → AI generation didn't
      complete; re-run `scraper create`; web-UI URL for manual
      recovery.
    - /cannot run more than \d+ jobs in parallel/i  → AI-Flow
      concurrent-job cap; serialise launches.
  Every `post`/`get` call in `handle_create_scraper`,
  `handle_run_scraper`, and `run_batch` passes `hints:
  SCRAPER_BODY_HINTS` so a 4xx from any of them is translated with
  the right vocabulary.

* Real zone-permission 403s (any body that doesn't match the
  scraper patterns) still get the original "Access denied" hint —
  test 'does not consult ERROR_HINTS when an extra-hint pattern
  matches' locks this in.

Tests: 8 unit tests for `client.pick_hint` using mock generic
patterns (covers mechanism + asserts the shared client carries no
scraper vocabulary in ERROR_HINTS), plus 5 scraper command tests
asserting the scraper patterns are well-formed and travel via
`hints` to client.post on every AI-Flow call. Two existing tests
relaxed from strict opts-object matches to objectContaining-style.
58 / 58 tests in the affected files pass. The 9 pre-existing
failures in unrelated suites (daemon, add-mcp, browser, discover,
scrape) on main are unchanged by this PR.
@anil-bd anil-bd force-pushed the fix/stub-collector-hint branch from a6957fa to 5157b51 Compare May 18, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant